# Define path to the updated locations Delta table
locations_sdf_updated_one <- spark_read_delta(
sc,
path = file.path(
getwd(),
"data",
"locations_sdf_updated_one"
)
) |>
filter(trip_id > 40000000) %>%
sdf_repartition(partitions = 24) %>% # Repartition for better parallel processing
sdf_register("locations_sdf_updated_one_view") # Register as a temporary view4 Part One - Processing Raster Data with Apache Sedona and Sparklyr in R
Using Raster Data to Determine Population Density Around Pickup and Dropoff Points
4.1 Introduction
We are going to demonstrate how to obtain information from raster files based on coordinates in this chapter. Let us assume that there is a relationship between the duration of a taxi ride and the population density of an area. We will, therefore, need to extract population density values at each pickup and dropoff location. Such granular data is typically available in raster format, which is why we use WorldPop population density data with a resolution of 1 km by 1 km for this purpose. You can download the data here to follow along.
The Spark configuration used for this chapter — and the next one — is identical to that used in Chapter 2, so it has not been included below for brevity’s sake.
Furthermore, because I need to re-run this code multiple times to render this website, I shall filter only a few rows from the 94 million we previously worked with, as I will be constantly updating this site. In the actual analysis I conducted, however, I used the full dataset.
You can find out more about using Apache Sedona for raster manipulation here.
4.2 Loading updated locations data
We start by loading our updated locations data, which now contains household median income by neighbourhood information.
To reiterate, I will filter for about 14 million rows here so that I can render this webpage faster. However, sometimes in your analysis, you will find that you have too much data and limited memory, especially when running complex transformations.
In such cases, you can filter for specific rows, perform your analysis on that subset, and then append the results to your delta tables. You can repeat this process for another set of rows until you are done.
For instance, knowing that I have trip ID values ranging from 0 to about 48,000,000, I would:
- First filter for rows between 0 and 16 million,
- Then 16 million to 32 million,
- And finally, anything above 32 million,
- Appending to the same folder each time.
If you have enough RAM and cores, though, feel free to run everything at once — go crazy with it!
# Check the size of each partition in the dataframe
locations_sdf_updated_one %>% sdf_partition_sizes()# Check the number of rows in the dataframe
locations_sdf_updated_one |>
sdf_nrow()print(locations_sdf_updated_one, width=Inf)4.3 Loading WorldPop Population Density dataset
The difference when using raster data compared to vector data with Apache Sedona is that we do not import raster in its native format directly. Instead, we must first load it as a binary dataframe, and then convert it into its native raster format within Sedona.
Also, bear in mind that Sedona only accepts raster files in the following formats:
- Arc Info ASCII Grid,
- GeoTIFF, and
- NetCDF.
If your data is in any other raster format, you will first need to convert it to one of these supported formats.
I have found GDAL to be particularly useful for converting between different raster formats — definitely a tool to keep in your geospatial toolbox!
# Load the raster data for world population (NYC)
world_pop_raster_filepath <- file.path(
getwd(),
"data",
"raster",
"worldpop",
"nyc.tif"
)# Read the raster data as a binary file
world_pop_binary <- spark_read_binary(
sc,
dir = world_pop_raster_filepath,
name = "worldpop"
)We obtain raster geometry from our GeoTiff data.
# Register the world population raster as a temporary view
world_pop_binary |> sdf_register("worldpop_view")
# Extract raster data from the GeoTiff file using Sedona
worldpop_sdf <- sdf_sql(
sc,
"
SELECT RS_FromGeoTiff(content) AS raster FROM worldpop_view
"
)
# Register the raster data as a temporary view
worldpop_sdf |> sdf_register("worldpop_view") |> compute()
worldpop_sdf %>% glimpse()We can retrieve metadata from our raster file, including:
- The upper left coordinates of the raster (in the raster’s coordinate system units),
- The width and height of the raster (in number of pixels),
- The spatial resolution of each pixel (in units of the raster’s CRS),
- Any skew or rotation of the raster (if present),
- The SRID (spatial reference system identifier) of the raster’s coordinate system,
- The number of bands, and
- Tile width and height.
In our case:
- Upper left X coordinate: -74.25125
- Upper left Y coordinate: 40.90792 (both in degrees as the CRS is WGS84)
- Raster size: 66 x 49 pixels (quite small)
- Pixel resolution: 0.00833 x -0.00833 degrees
- Skew: 0 in both x and y directions (i.e., no skew)
- SRID: 4326 (WGS 84)
- Number of bands: 2
- Tile width: 66, Tile height: 15
All this information is important when interpreting and working with raster data, especially when performing coordinate-based queries.
# Retrieve and view metadata for the world population raster
worldpop_sdf_metadata <- sdf_sql(
sc,
"
SELECT RS_MetaData(raster) FROM worldpop_view
"
) |> collect()# Glimpse at the metadata information
worldpop_sdf_metadata |> glimpse()4.4 Joining point data with raster data
We now conduct the join using Spatial SQL, as it is much easier and more intuitive than using Apache Sedona’s R functions for raster operations.
By leveraging Spatial SQL, we can directly query raster values at specific pickup and dropoff coordinates, simplifying what would otherwise be a more complex process if done via function-based syntax.
# Perform a spatial join between the locations and the world population data to calculate population density
locations_sdf_updated_two <- sdf_sql(
sc,
"
SELECT
/*+ BROADCAST(w) */ l.*, RS_Value(w.raster, ST_Point(l.longitude, l.latitude)) AS pop_density
FROM
locations_sdf_updated_one_view l
LEFT JOIN worldpop_view w
ON RS_Intersects(w.raster, ST_POINT(l.longitude, l.latitude))
"
) We can now take a look at the result of our join below.
# Glimpse at the updated data with population density
locations_sdf_updated_two %>% glimpse()
# Print a preview of the resulting dataframe with specific formatting options
withr::with_options(
list(pillar.sigfig = 11),
print(locations_sdf_updated_two, n=30)
)4.5 Saving the data
And finally save the data to file.
# Define file path for saving the updated dataframe
locations_sdf_updated_two_file_path <- file.path(
getwd(),
"data",
"locations_sdf_updated_two"
)
# Save the final dataframe as a Delta table
spark_write_delta(
locations_sdf_updated_two,
path = locations_sdf_updated_two_file_path,
mode = "append" # Overwrite any existing data at the location
)# Disconnect from the Spark session
spark_disconnect(sc)